160 research outputs found

    Block CUR: Decomposing Matrices using Groups of Columns

    Full text link
    A common problem in large-scale data analysis is to approximate a matrix using a combination of specifically sampled rows and columns, known as CUR decomposition. Unfortunately, in many real-world environments, the ability to sample specific individual rows or columns of the matrix is limited by either system constraints or cost. In this paper, we consider matrix approximation by sampling predefined \emph{blocks} of columns (or rows) from the matrix. We present an algorithm for sampling useful column blocks and provide novel guarantees for the quality of the approximation. This algorithm has application in problems as diverse as biometric data analysis to distributed computing. We demonstrate the effectiveness of the proposed algorithms for computing the Block CUR decomposition of large matrices in a distributed setting with multiple nodes in a compute cluster, where such blocks correspond to columns (or rows) of the matrix stored on the same node, which can be retrieved with much less overhead than retrieving individual columns stored across different nodes. In the biometric setting, the rows correspond to different users and columns correspond to users' biometric reaction to external stimuli, {\em e.g.,}~watching video content, at a particular time instant. There is significant cost in acquiring each user's reaction to lengthy content so we sample a few important scenes to approximate the biometric response. An individual time sample in this use case cannot be queried in isolation due to the lack of context that caused that biometric reaction. Instead, collections of time segments ({\em i.e.,} blocks) must be presented to the user. The practical application of these algorithms is shown via experimental results using real-world user biometric data from a content testing environment.Comment: shorter version to appear in ECML-PKDD 201

    Optimal CUR Matrix Decompositions

    Full text link
    The CUR decomposition of an m×nm \times n matrix AA finds an m×cm \times c matrix CC with a subset of c<nc < n columns of A,A, together with an r×nr \times n matrix RR with a subset of r<mr < m rows of A,A, as well as a c×rc \times r low-rank matrix UU such that the matrix CURC U R approximates the matrix A,A, that is, ACURF2(1+ϵ)AAkF2 || A - CUR ||_F^2 \le (1+\epsilon) || A - A_k||_F^2, where .F||.||_F denotes the Frobenius norm and AkA_k is the best m×nm \times n matrix of rank kk constructed via the SVD. We present input-sparsity-time and deterministic algorithms for constructing such a CUR decomposition where c=O(k/ϵ)c=O(k/\epsilon) and r=O(k/ϵ)r=O(k/\epsilon) and rank(U)=k(U) = k. Up to constant factors, our algorithms are simultaneously optimal in c,r,c, r, and rank(U)(U).Comment: small revision in lemma 4.

    Variant Ranker: a web-tool to rank genomic data according to functional significance

    Get PDF
    BACKGROUND: The increasing volume and complexity of high-throughput genomic data make analysis and prioritization of variants difficult for researchers with limited bioinformatics skills. Variant Ranker allows researchers to rank identified variants and determine the most confident variants for experimental validation. RESULTS: We describe Variant Ranker, a user-friendly simple web-based tool for ranking, filtering and annotation of coding and non-coding variants. Variant Ranker facilitates the identification of causal variants based on novelty, effect and annotation information. The algorithm implements and aggregates multiple prediction algorithm scores, conservation scores, allelic frequencies, clinical information and additional open-source annotations using accessible databases via ANNOVAR. The available information for a variant is transformed into user-specified weights, which are in turn encoded into the ranking algorithm. Through its different modules, users can (i) rank a list of variants (ii) perform genotype filtering for case-control samples (iii) filter large amounts of high-throughput data based on user custom filter requirements and apply different models of inheritance (iv) perform downstream functional enrichment analysis through network visualization. Using networks, users can identify clusters of genes that belong to multiple ontology categories (like pathways, gene ontology, disease categories) and therefore expedite scientific discoveries. We demonstrate the utility of Variant Ranker to identify causal genes using real and synthetic datasets. Our results indicate that Variant Ranker exhibits excellent performance by correctly identifying and ranking the candidate genes CONCLUSIONS: Variant Ranker is a freely available web server on http://paschou-lab.mbg.duth.gr/Software.html . This tool will enable users to prioritise potentially causal variants and is applicable to a wide range of sequencing data

    Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming

    Full text link
    Sketching algorithms have recently proven to be a powerful approach both for designing low-space streaming algorithms as well as fast polynomial time approximation schemes (PTAS). In this work, we develop new techniques to extend the applicability of sketching-based approaches to the sparse dictionary learning and the Euclidean kk-means clustering problems. In particular, we initiate the study of the challenging setting where the dictionary/clustering assignment for each of the nn input points must be output, which has surprisingly received little attention in prior work. On the fast algorithms front, we obtain a new approach for designing PTAS's for the kk-means clustering problem, which generalizes to the first PTAS for the sparse dictionary learning problem. On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and kk-means clustering. In particular, given a design matrix ARn×d\mathbf A\in\mathbb R^{n\times d} in a turnstile stream, we show an O~(nr/ϵ2+dk/ϵ)\tilde O(nr/\epsilon^2 + dk/\epsilon) space upper bound for rr-sparse dictionary learning of size kk, an O~(n/ϵ2+dk/ϵ)\tilde O(n/\epsilon^2 + dk/\epsilon) space upper bound for kk-means clustering, as well as an O~(n)\tilde O(n) space upper bound for kk-means clustering on random order row insertion streams with a natural "bounded sensitivity" assumption. On the lower bounds side, we obtain a general Ω~(n/ϵ+dk/ϵ)\tilde\Omega(n/\epsilon + dk/\epsilon) lower bound for kk-means clustering, as well as an Ω~(n/ϵ2)\tilde\Omega(n/\epsilon^2) lower bound for algorithms which can estimate the cost of a single fixed set of candidate centers.Comment: To appear in NeurIPS 202

    Randomized Extended Kaczmarz for Solving Least-Squares

    Full text link
    We present a randomized iterative algorithm that exponentially converges in expectation to the minimum Euclidean norm least squares solution of a given linear system of equations. The expected number of arithmetic operations required to obtain an estimate of given accuracy is proportional to the square condition number of the system multiplied by the number of non-zeros entries of the input matrix. The proposed algorithm is an extension of the randomized Kaczmarz method that was analyzed by Strohmer and Vershynin.Comment: 19 Pages, 5 figures; code is available at https://github.com/zouzias/RE

    Fast approximation of matrix coherence and statistical leverage

    Full text link
    The statistical leverage scores of a matrix AA are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr\"{o}m-based low-rank matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n×dn \times d matrix AA, with ndn \gg d, and that returns as output relative-error approximations to all nn of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of nn and dd) in O(ndlogn)O(n d \log n) time, as opposed to the O(nd2)O(nd^2) time required by the na\"{i}ve algorithm that involves computing an orthogonal basis for the range of AA. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with ndn \approx d, and the extension to streaming environments.Comment: 29 pages; conference version is in ICML; journal version is in JML

    Solving kk-means on High-dimensional Big Data

    Full text link
    In recent years, there have been major efforts to develop data stream algorithms that process inputs in one pass over the data with little memory requirement. For the kk-means problem, this has led to the development of several (1+ε)(1+\varepsilon)-approximations (under the assumption that kk is a constant), but also to the design of algorithms that are extremely fast in practice and compute solutions of high accuracy. However, when not only the length of the stream is high but also the dimensionality of the input points, then current methods reach their limits. We propose two algorithms, piecy and piecy-mr that are based on the recently developed data stream algorithm BICO that can process high dimensional data in one pass and output a solution of high quality. While piecy is suited for high dimensional data with a medium number of points, piecy-mr is meant for high dimensional data that comes in a very long stream. We provide an extensive experimental study to evaluate piecy and piecy-mr that shows the strength of the new algorithms.Comment: 23 pages, 9 figures, published at the 14th International Symposium on Experimental Algorithms - SEA 201

    Near Optimal Linear Algebra in the Online and Sliding Window Models

    Full text link
    We initiate the study of numerical linear algebra in the sliding window model, where only the most recent WW updates in a stream form the underlying data set. We first introduce a unified row-sampling based framework that gives randomized algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and 1\ell_1-subspace embeddings in the sliding window model, which often use nearly optimal space and achieve nearly input sparsity runtime. Our algorithms are based on "reverse online" versions of offline sampling distributions such as (ridge) leverage scores, 1\ell_1 sensitivities, and Lewis weights to quantify both the importance and the recency of a row. Our row-sampling framework rather surprisingly implies connections to the well-studied online model; our structural results also give the first sample optimal (up to lower order terms) online algorithm for low-rank approximation/projection-cost preservation. Using this powerful primitive, we give online algorithms for column/row subset selection and principal component analysis that resolves the main open question of Bhaskara et. al.,(FOCS 2019). We also give the first online algorithm for 1\ell_1-subspace embeddings. We further formalize the connection between the online model and the sliding window model by introducing an additional unified framework for deterministic algorithms using a merge and reduce paradigm and the concept of online coresets. Our sampling based algorithms in the row-arrival online model yield online coresets, giving deterministic algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and 1\ell_1-subspace embeddings in the sliding window model that use nearly optimal space

    On landmark selection and sampling in high-dimensional data analysis

    Full text link
    In recent years, the spectral analysis of appropriately defined kernel matrices has emerged as a principled way to extract the low-dimensional structure often prevalent in high-dimensional data. Here we provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets. In particular, a data subsampling or landmark selection process is often employed to construct a kernel based on partial information, followed by an approximate spectral analysis termed the Nystrom extension. We provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process. We compare the practical implications of these bounds by way of real-world examples drawn from the field of computer vision, whereby low-dimensional manifold structure is shown to emerge from high-dimensional video data streams.Comment: 18 pages, 6 figures, submitted for publicatio
    corecore